Objectives

In this project, we aim to implement a proper movie recommendation system for customers using the data frame provided by netflix back in 2005 for its prize competition. Here, we will supplement suitable data pre-processing methods, data explorations, and data visualizations to ultimately aid the readers of this report in understanding the work-flow of our project, which finally leads to the development of our models, complemented with various error analysis, and suggestions for improvements.

Libraries

Firstly, we import some of the necessary packages for the development of our project and deployment of suitable models.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(glue)
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

Data Reading

In this section we will read the data into our IDE (i.e. in our case, to RStudio).

data <- data %>% rename(movie_id = V1,
                        customer_id = V2,
                        rating = V3,
                        date = V4
                        ) %>% 
  mutate(date = as.Date(date)) %>% 
  mutate(rating = as.factor(rating))

data %>% head()
dim(data)
## [1] 100480507         4

Data Pre-processing

Data inspection

data_summary <- data %>% 
  group_by(rating)  %>% 
  count(rating) %>% 
  mutate(rating = as.factor(rating))
data_summary
levels(data_summary$rating)
## [1] "1" "2" "3" "4" "5"
label_percent <- label_dollar(suffix = '%' ,prefix = '')
data_summary <- data_summary %>% mutate(rating = as.factor(rating))
data_summary <- na.omit(data_summary)
data_summary$prob <- data_summary$n/sum(data_summary$n)*100
data_summary <- data_summary %>% 
  mutate(tooltip = glue("distribution: {label_percent(prob)}"))
data_summary

Data Visualization

data_sum_graph <- ggplot(data_summary, aes(x = rating, y = prob, text = tooltip, fill = prob)) +
  geom_col(position = "identity") +
  labs(title = "Probability Distribution of Movie Ratings",
       subtitle = paste("For data set 1 (", unique(data$movie_id), " movies, ", unique(data$customers), " customers, and ", nrow(data$rating), " ratings"),
       x = "Movie rating",
       y = "Probability") +
  scale_fill_gradient(low = "#e4333e", high = "#52171a") +
  theme_minimal() 
ggplotly(data_sum_graph, tooltip = c("text"))

Model Training

Collaborative Filtering

R Pearson’s Correlation